Haplotype-Resolved, Chromosome-Level Assembly of White Clover (Trifolium repens L., Fabaceae)

Abstract White clover (Trifolium repens L.; Fabaceae) is an important forage and cover crop in agricultural pastures around the world and is increasingly used in evolutionary ecology and genetics to understand the genetic basis of adaptation. Historically, improvements in white clover breeding practices and assessments of genetic variation in nature have been hampered by a lack of high-quality genomic resources for this species, owing in part to its high heterozygosity and allotetraploid hybrid origin. Here, we use PacBio HiFi and chromosome conformation capture (Omni-C) technologies to generate a chromosome-level, haplotype-resolved genome assembly for white clover totaling 998 Mbp (scaffold N50 = 59.3 Mbp) and 1 Gbp (scaffold N50 = 58.6 Mbp) for haplotypes 1 and 2, respectively, with each haplotype arranged into 16 chromosomes (8 per subgenome). We additionally provide a functionally annotated haploid mapping assembly (968 Mbp, scaffold N50 = 59.9 Mbp), which drastically improves on the existing reference assembly in both contiguity and assembly accuracy. We annotated 78,174 protein-coding genes, resulting in protein BUSCO completeness scores of 99.6% and 99.3% against the embryophyta_odb10 and fabales_odb10 lineage datasets, respectively.


Introduction
White clover (Trifolium repens L., Fabaceae) is a prostrate, herbaceous perennial that spreads via stolons, forming large clonal patches up to 1 meter across (Burdon 1983). It originated as an allotetraploid in the Mediterranean 15-28 Ka resulting from the hybridization of its diploid progenitors, T. occidentale and T. pallescens (Williams et al. 2012;Griffiths et al. 2019). Because of its rapid growth and symbiosis with nitrogen-fixing bacteria, white clover is an important forage crop in agricultural pastures, and it has become naturalized in diverse climates around the world over the last several hundred years (Burdon 1983;Kjaergaard 2003). Today, there are large efforts to improve production and survival in variable environments, including traits such as yield and biomass production (Barrett et al. 2009;Moeskjaer et al. 2022), salt tolerance (Wang et al. 2010), drought tolerance (Annicchiarico and Piano 2004;Jiang et al. 2010), frost tolerance (Inostroza et al. 2018;Zhang et al. 2022), and disease resistance (Panter et al. 2012). Currently, most white clover breeding relies on phenotypic selection, although marker-assisted breeding designs are increasingly common and would be greatly facilitated by a well-annotated, chromosome-level reference genome assembly (Faville et al. 2012;Moeskjaer et al. 2022).
In addition to its use in agricultural mixed-grass pastures and breeding programs, white clover has become a model in evolutionary ecology and genetics for understanding adaptation to environmental gradients and agents of selection in nature. Early work documented latitudinal and altitudinal clines in the frequency of cyanogenesis, the production of hydrogen cyanide in response to tissue damage-an antiherbivore defense whose metabolic components can also affect tolerance to abiotic stressors (e.g., drought, frost) (Daday 1954a(Daday , 1954b(Daday , 1965(Daday , 1958Hughes 1991). More recent work has corroborated these widespread continental clines (Innes et al. 2022;Olsen 2012, 2013), uncovered clines on smaller spatial scales across urban-rural gradients , identified the molecular mechanisms underlying genetic variation in cyanogenesis (Olsen et al. 2007(Olsen et al. , 2008Olsen and Small 2018;Olsen et al. 2021), and experimentally tested the ecological factors maintaining the cyanogenesis polymorphism (Kooyers et al. 2014(Kooyers et al. , 2018Albano and Johnson 2023;Emad Fadoul et al. 2023) and its evolutionary consequences (Thompson and Johnson 2016;Santangelo et al. 2018). Although much of this work has focused on the cyanogenesis polymorphism -a trait with well-characterized inheritance attributable to two epistatically-interacting Mendelian loci-ongoing and future work will leverage white clover's rich history in evolutionary ecology to examine the genetic basis of adaptation at various spatial scales for which a high-quality reference assembly will be essential. In particular, a chromosome-level, haplotype-resolved assembly would facilitate identifying structural variants involved in adaptation (Battlay et al. 2023) and improve our understanding of the evolutionary consequences of polyploidization in this ecologically and agronomically important allotetraploid.
Owing to the inherently repetitive nature of polyploid genomes, chromosome-level and haplotype-resolved genome assemblies have been challenging for these taxa. However, new technologies allow us to span difficult repetitive elements and offer the ability to greatly improve and expand earlier genome assemblies. Here, we present a chromosome-level, haplotype-resolved genome assembly of the model legume white clover using PacBio HiFi and chromosome conformation capture (Dovetail Omni-C) technologies. We present two genomes as part of this project: 1) an unannotated, haplotype-resolved assembly and 2) a functionally annotated haploid mapping assembly, which we compare with the previous reference assembly (Griffiths et al. 2019) using two recently generated linkage maps for the species (Olsen et al. 2021).

Genome Assemblies
Our final haplotype-resolved assembly totaled 998,247,995 bp for haplotype 1 (N = 693 scaffolds total) and 1,009,398,733 bp for haplotype 2 (N = 1,022 scaffolds total) (table 1), slightly shorter than the ∼1.1 Gbp previously estimated genome size for T. repens species (Griffiths et al. 2019). Haplotype 1 had genome BUSCO completeness Our haploid mapping assembly totaled 968,215,888 bp assembled into 18 chromosomes (16 chromosomes + 2 organelles), with unplaced scaffolds removed from this assembly because our goal was to focus on assembled, contiguous chromosomes for this assembly. This resulted in the exclusion of 1,697 scaffolds from across both haplotypes, representing ∼30 Mbp of sequence (∼3% of genome). A total of 93.8% (N = 2,186) of the 2,330 filtered linkage markers for the "DG" mapping population mapped to the correct chromosome in our new assembly, which was a dramatic improvement compared with the 39.4% (N = 919) of the correctly mapped markers from the previous assembly ( fig. 1A). Similar results were obtained for the "SG" mapping population (supplementary fig. S2, Supplementary Material online). The haploid mapping assembly showed complete BUSCO scores of 99.6% and 99.5% against the embryophyta and fabales lineage datasets, respectively, with most of these genes occurring in duplicate copy (supplementary table S1, Supplementary Material online) Annotation We softmasked 59.4% (∼576.5 Mbp) of the haploid reference assembly ( fig. 1B) to improve gene model prediction during annotation. Of the classified repetitive elements, most (27.2%) were Long Terminal Repeats (i.e., LTRs) elements, with Ty1/Copia (13.5%) and Gypsy/DIRS1 (9.7%) elements making up the majority (supplementary table S2, Supplementary Material online). We annotated 78,174 genes consisting of 87,929 messenger RNA (mRNA) transcripts that together account for 23.2% of the genome (table 1 and fig. 1B). Thirty-nine thousand four hundred twenty-five of our annotated genes occur on the T. occidentale subgenome, with the remaining 38,749 on the T. pallescens subgenome, consistent with the number of genes of closely related diploid Trifolium species (T. pratense: 43,682; T. subterraneum: 42,704). Synteny between the subgenomes is largely preserved, except for three translocations between nonhomoeologous chromosomes and six inversions between homoeologous chromosomes ( fig. 1C). Of the 78,174 genes, 4,868 (6.2%) are completely overlapped by repeats and likely represent transposable element protein-coding sequences. Most mRNAs (∼87%; N = 77,043) had at least one functional annotation (table 1), with 15,871 genes containing common names. Our final annotated protein set had complete protein BUSCO scores of 99.6% and 99.3% against the embryophyta and fabales lineage datasets, respectively (supplementary table S1, Supplementary Material online).

Conclusion
We have provided a chromosome-level, haplotype-resolved genome assembly of the allotetraploid white clover (T. repens), and a functionally annotated haploid mapping assembly that shows substantial improvements over the existing reference genome for the species. These assemblies will facilitate marker-assisted breeding programs for traits of agronomic importance and provide increased resolution and ability to identify the genomic basis of adaptation in this increasingly used model in evolutionary ecology and genetics. Together with an alternative and upcoming chromosomelevel assembly (Wang et al. 2023) and other high-quality reference genomes in the genus (Dluhošová et al. 2018;Bickhart et al. 2022;Shirasawa et al. 2023), our haplotype-resolved assembly will be particularly useful for identifying structural variation and facilitate the development of pangenomic references (Eizenga et al. 2020) for which haplotype-resolved assemblies are an asset (Garg et al. 2022).

Plant Sample
We sequenced an F4 T. repens genotype that was generated as part of a separate experiment. As a diploidized allotetraploid, T. repens exhibits disomic inheritance with chromosomes from both subgenomes segregating independently, and plants are obligately outcrossing due to a gametophytic self-incompatibility. The sequenced plant originated from an F0 crosses between a plant from Ontario, Canada, and a plant from Louisiana, USA, followed by three generations of random crossing among the F1s, F2s, and finally the F3s. The sequenced plant was maintained in a 1 L pot in potting soil (Pro-Mix LP15; Premier Tech, Rivière-du-Loup, Canada) in a growth chamber set to 25 °C on a 12 h light:12 h dark cycle, though the plant was maintained in the dark for 48 h prior to sampling to reduce polysaccharide content. The plant was nondestructively harvested on March 28, 2022, by sampling approximately 2.5 g of leaf tissue, immediately flash-freezing tissue in liquid nitrogen, and storing it at -80 °C prior to shipping on dry ice to Dovetail Genomics for DNA extraction, library preparation, and sequencing. Sequencing DNA samples were quantified using a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The PacBio SMRTbell library (∼20 kbp mean insert length) for PacBio Sequel was constructed using SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer's recommended protocol. The library was bound to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II. Sequencing was performed on PacBio Sequel II 8 M SMRT cells generating 58 Gbp of data. These PacBio Circular Consensus Sequencing (i.e., CCS) reads were used as an input to "hifiasm" v0.16.1-r375 (Cheng et al. 2022(Cheng et al. , 2021) (see Scaffolding below).
For each Dovetail Omni-C library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DNAse I; chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, crosslinks were reversed, and the DNA was purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments. Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotincontaining fragments were isolated using streptavidin beads before polymerase chain reaction enrichment of each library. The library was sequenced on an Illumina HiSeqX platform (Illumina, San Diego, California, USA) to produce ∼30 × sequence coverage. The PacBio CCS reads and Omni-C reads (MQ > 50) were then used as input for "hifiasm" to produce two haplotype-resolved assemblies (hap1 and hap2) using default parameters.

Scaffolding
We first produced an initial assembly of all PacBio HiFi data with "hifiasm" in the "primary" mode. This resulted in two sets of contigs: primary and alternative. We then combined primary and alternative contigs into a single set of all contigs, containing 1,384,338,092 bp of sequence in 6,189 contigs with N50 size of 15,304,949 bp, which we call the "unresolved" contig set below. Next, to determine which contig was derived from which subgenome, we used Illumina reads for the diploid parental species Trifolium occidentale (SRR8593471) and Trifolium pallescens (SRR8617466) downloaded from NCBI's Sequence Read Archive (SRA). We mapped the Illumina reads to the unresolved contig set with "bwa mem" (Li and Durbin 2009), used the best alignment for each Illumina read, counted the number of reads from each parental species mapped to each contig, and divided it by the total number of Illumina reads in each parental set. Based on the weighted number of alignments of the parental reads, we labeled the contigs in the unresolved set with "Pall" and "Occ" labels corresponding to the two subgenomes, resulting in a "labeled" set of contigs.
We then aligned the PacBio HiFi reads and the Omni-C reads to the labeled contigs with the "minimap2" (Li 2021) and "bwa mem" aligners, computed the best alignment for each read, and split the HiFi and Omni-C reads into subsets for each subgenome. We required that both Omni-C reads have the best alignment to the same subgenome to be assigned to that subgenome. Next, we assembled the two subsets of HiFi/Omni-C reads separately with "hifiasm" Hi-C in haplotype resolved mode. This yielded haplotype-resolved assemblies for the two subgenomes. We then scaffolded the assemblies with "HiRise" scaffolder (Putnam et al. 2016) and closed gaps in the scaffolds with "SAMBA" scaffolder (Zimin and Salzberg 2022). The final step was to remove redundant haplotype contigs that "hifiasm" sometimes keeps in the assembly. We did this by aligning all contigs shorter than 1 Mbp to the assembly for each of the haplotypes with "nucmer" (Marçais et al. 2018) and excluding the contigs that mapped to the interior of other bigger contigs with better than 95% similarity over at least 75% of their length. This resulted in the final set of assembled haplotypes (2 subgenomes × 2 haplotypes each = 4 haplotypes).

Assembly
All analyses from here forward are implemented in an open and reproducible Snakemake v7.16 pipeline (Mölder et al. 2021). The pipeline begins with input of the Dovetail haplotype assemblies, associated AGP (i.e., "A Golden Path") files and linkage map data from (Olsen et al. 2021), and ends with the generation of the phased diploid assembly in FASTA format (NCBI BioProjects PRJNA957817 and PRJNA957816), the annotated haploid mapping assembly in FASTA, NCBI Sequin, and GFF3 formats (BioProject PRJNA951196), and manuscript figures. See Data Accessibility for links to data and code.
Before assembling the reference genomes, the assembled haplotypes required manual curation to correct minor misassemblies and fill gaps generated during scaffolding. First, we used BLAST v2.12.0 (Altschul et al. 1990) to align two previously generated linkage maps for the species (Olsen et al. 2021) to each of the four haplotypes and to the previous T. repens reference genome (Griffiths et al. 2019). We removed alignments that were less than 175 bp in length of the 200 bp total length for each linkage mapping marker sequence and had less than 95% identity, and retained only the best alignment (i.e., lowest E-value) for each marker. These alignments were used to identify misassembled scaffolds and to assess correspondence between the scaffolds in the newly assembled haplotypes and the chromosomes in the previous reference genome.
Second, we used "Minimap2" v2.24 (Li 2021) to generate pairwise alignments between all four haplotypes. Together with the linkage map alignments above, these alignments enabled us to fill in three gaps (likely spanning the centromere) and one telomere with unplaced scaffolds (supplementary fig. S1, Supplementary Material online). In addition, the scaffolding generated a double telomere at the end of one of the chromosomes in the T. pallescens subgenome; this extra telomere was removed and added to its correct location at the end of the homoeologous chromosome in the T. occidentale subgenome (supplementary fig. S1, Supplementary Material online). All manual fixes were implemented in BioPython v1.8 (Cock et al. 2009).
We used the revised haplotypes to generate two separate reference genome assemblies: a haplotype-resolved assembly and a collapsed haploid mapping assembly. As a diploidized allotetraploid (see above), T. repens' four haplotypes can be collapsed into two haplotypes, each containing eight chromosomes from each subgenome (i.e., N = 16) resulting in a phased "diploid" assembly (i.e., 2N = 32). We therefore present this assembly as two FASTA files, with one for each of these two haplotypes. These FASTA files additionally include all unplaced scaffolds for each of the haplotypes. We additionally created a haploid mapping assembly, generated by taking the longer chromosome of each of the two haplotypes for each linkage group. This haploid mapping assembly was used for the structural and functional annotation described below. Both the diploid and haploid assemblies were checked for annotation completeness by running BUSCO v5.4.6 (Seppey et al. 2019) in "genome" mode against the embryophyta_odb10 and fabales_odb10 lineage datasets.

Structural Annotation
To improve gene-model predictions, we softmasked repeats prior to proceeding with annotation. First, we used "RepeatModeler" v2.0.3 (Flynn et al. 2020) to generate a repeat library using the haploid mapping reference as input. This database was then merged with RepBase (v. RedBaseRepeatMaskerEdition-20181026), and the combined repeat library was used to softmask repeats using "RepeatMasker" v4.1.3.
We predicted gene models and generated a structural annotation of the haploid mapping assembly by combining evidence from proteins in related plant species and RNA-Seq evidence in T. repens. First, we ran "BRAKER" v3.0.0  in "protein mode" using proteins from all green plants (i.e., Viridiplantae) as input, supplemented with proteins from all legumes (family: Fabaceae) from the UniProtKB database (N = 1,233,771 proteins). Next, we downloaded a subset of all RNA-Seq data from T. repens available from four published sources (Nagy et al. 2013;Griffiths et al. 2019;Zhou et al. 2021;Zhang et al. 2022), selected to represent diverse tissue types and library preparation protocols (N = 21 RNAseq libraries total, supplementary table S3, Supplementary Material online). We mapped the raw RNA-Seq reads to the haploid mapping reference using "STAR" v2.7.0b (Dobin et al. 2013) in "two-pass mode" and merged the resulting BAM files using SAMtools v1.16.1 ). We used this merged BAM file as input to "BRAKER" in "RNAseq" mode. Next, we combined evidence from "protein" and "RNAseq" modes using "TSEBRA" (Gabriel et al. 2021) before using the "agat_convert_gxf2gxf.pl" script from "AGAT" v1.0.0-pl5321hdfd78af_0 (Dainat et al. 2022) to convert the BRAKER-generated GTF to GFF3 format and proceeding to functional annotation.

Functional Annotation
We added functional annotations by querying extracted proteins against numerous databases prior to merging and formatting annotations for uploading to NCBI. First, we retrieved functional annotations using "InterProScan" v5.61.93-0 (Jones et al. 2014) and "Eggnog-mapper" v2.1.10 (Cantalapiedra et al. 2021). The resulting outputs were passed as input to "funannotate" v1.8.14 (Palmer and Stajich 2019), which combined the annotations and queried some additional databases. For any protein annotated as "hypothetical protein" and containing a fully resolved enzyme commission (i.e., EC) number (i.e., resolved to four digits), we replaced the "hypothetical protein" annotation with the EC number's product in the ExPASSY Enzyme database (Bairoch 2000). If the EC number was only resolved to three digits or fewer, we kept the "hypothetical protein" annotation and removed the EC number. In the end, the following databases were queried for annotations: InterPro v93.0 (Blum et al. 2021 (Mistry et al. 2021), GO v2023-03-06 (Ashburner et al. 2000;Gene Ontology Consortium 2021), and MiBig v1.4 (Terlouw et al. 2023). We queried our final annotation against the embryophyta_odb10 and fabales_odb10 BUSCO databases using BUSCO v5.4.6 (Seppey et al. 2019) in "protein" mode and assessed synteny between the T. occidentale and T. pallescens subgenomes using GENESPACE with default parameters except "useHOGs" and "orthofinderInBlk" were set to TRUE (Lovell et al. 2022).

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Data Availability
All code to reproduce this manuscript's results can be found on JSS's GitHub (see https://github.com/James-S-Santangelo/dcg) and is archived on Zenodo (https:// zenodo.org/record/8,180,534). In addition to code, the Zenodo repository contains the raw haplotypes assembled by Dovetail and the AGP files required as input to the pipeline. All other data are contained within the GitHub/Zenodo repository, except for proprietary databases (e.g., RepBase) that could not be included (see GitHub README). The raw data used in the assemblies have been deposited on NCBI (BioProject PRJNA979795) along with the annotated haploid mapping assembly (BioProject PRJNA951196) and individual haplotype assemblies (BioProjects PRJNA957816 and PRJNA957817).